March 1, 2019

Goals

Present logic and tools for interactively plotting static data

In two weeks we'll discuss Shiny, where data can be dynamic

  • When to use interactivity
  • Work through examples with
    • Plotly
    • Crosstalk
  • Other packages, RBokeh, Highcharter
  • What's going on on under the hood

Why Interactive Plots?

  • They look cool
  • They reduce decision-making
  • Exploratory data analysis, your own or the user's Plots you publish have an argument, focus reader attention
  • Browser-based, easy to share, few compatibility issues
  • Expose the data
  • Increase the amount of information you can share cleanly
  • Brushing and linking Draw connections between data Uncover relationships
  • You may not need shiny

Motivating example: Pizza to the Polls

Pizza to the polls
Delivered over 10,000 pizzas to 41 states
Data from 2018 midterm elections

The data is in the intro_to_interactive folder. Load it as below and view fields.

setwd("~/Documents/DSI/intro_to_interactive") # UPDATE path to intro_to_interactive
pizza = read.csv("pizza_midterms.csv")
names(pizza) # We're interested in a subset of these fields
##  [1] "X"                              "Status"                        
##  [3] "Timestamp"                      "Verified"                      
##  [5] "Cost"                           "Number_of_pizzas"              
##  [7] "Polling_place_address"          "Link_to_report_on_social_media"
##  [9] "Address"                        "City"                          
## [11] "State"                          "Estimated_wait_time"           
## [13] "lon"                            "lat"                           
## [15] "Timestamp_date"                 "Timestamp_hour"

We introduce some concepts using the texas housing (txhousing) data that's packaged with plotly. This data set has information on median housing prices and other sales information (volume, listing, inventory) over time for cities in Texas.

Pizza data

head(pizza, 3)
##   X    Status        Timestamp         Verified    Cost Number_of_pizzas
## 1 1 Delivered 11/06/18 1:08 PM              Yes $246.95               14
## 2 2       New 11/06/18 3:15 PM                                        NA
## 3 3 Delivered 11/06/18 8:44 AM \U0001f916 found $180.57               10
##                      Polling_place_address
## 1 1 Brookings Dr, St. Louis, MO 63130, USA
## 2 1 Brookings Dr, St. Louis, MO 63130, USA
## 3       1 Hall Dr, Richmond, KY 40475, USA
##                                 Link_to_report_on_social_media   Address
## 1 https://twitter.com/mfriedmanmets/status/1059857762098864128          
## 2                                                                       
## 3     https://www.facebook.com/WBONTV/videos/1849851705112199/ 1 Hall Dr
##        City State Estimated_wait_time       lon      lat Timestamp_date
## 1 St. Louis    MO                     -90.30505 38.64820     2018-11-06
## 2 St. Louis    MO                     -90.30505 38.64820     2018-11-06
## 3  Richmond    KY                     -84.30428 37.73337     2018-11-06
##   Timestamp_hour
## 1             13
## 2             15
## 3              8

Example - Static Plot

Here is a static plot built in ggplot. What is it lacking?

How to visualize this data?

That depends on what we're interested in showing.

Specific questions:

  • Where were pizzas delivered? <- our focus
  • When were pizzas delivered? Relative to poll-closing times?

Contextual question:

  • What describes places where people wait on lines?

Some hypotheses:

  • Longer lines in cities, densely populated areas
  • Longer lines in states with close elections
  • Longer lines in states without vote-by-mail
  • More deliveries in democratic-leaning locations (Pizza to the polls network)
  • Influence of vote suppression, voter ID laws

Deep exploration of the data is helpful in refining hypotheses.
We expect these effects to be non-uniform effects – complex models.

As of January 30, 2018, Colorado, Oregon, and Washington conducted all elections using a vote-by-mail system (via ballotpedia).

# NOT the elections package on cran - https://github.com/MEDSL/elections/blob/master/README.md
# make sure devtools is updated
# if (!require('devtools', quietly = TRUE)) install.packages('devtools')
# devtools::install_github('MEDSL/elections') 

# The package makes available the following datasets:

# presidential_precincts_2016
# senate_precincts_2016
# house_precincts_2016
# state_precincts_2016
# local_precincts_2016

library(elections)

# Show percent dem by state for 2016 presidential election:

data("presidential_precincts_2016"); head(presidential_precincts_2016)

pres_by_state_returns_2016 = presidential_precincts_2016 %>% 
  group_by(state_postal, party) %>%
  summarize(party_votes = sum(votes))

state_2party = left_join(pres_by_state_returns_2016 %>% filter(grepl(party, pattern="[D|d]emocrat")),
                         pres_by_state_returns_2016 %>% filter(grepl(party, pattern="[R|r]epublican")), by = "state_postal")

state_2party = state_2party %>% 
  group_by(state_postal) %>% 
  summarize(votes.dem = sum(party_votes.x), votes.rep = sum(party_votes.y))

us.state.dem = left_join(us.state, state_2party, by = c("state_abbr" =  "state_postal")) %>% 
  mutate(percent_dem = votes.dem/(votes.dem + votes.rep))


plot3 = ggplot(data = pizza.grouped, aes(x = lon, y = lat, size = Pizzas_delivered)) + 
          ylim(24,75) + xlim(-175, -67) +
          geom_sf(data = us.state.dem, aes(fill =  percent_dem), inherit.aes = FALSE) +
          scale_fill_gradient(low = "white", high = "black", limits = c(0,1)) + 
          geom_point(color = "red", alpha = .5) + 
          ggtitle("Delivery locations (By unique polling places -- some overlap)")

plot3

# Show percent dem by congressional district for 2016 presidential election:

data("house_precincts_2016"); head(house_precincts_2016)


head(us_congressional())

# lm ----
#polling places listed by state
View(pizza.grouped %>% group_by(state) %>% summarize(n()))
View(filter(pizza.grouped, Status == "Delivered") %>% group_by(state) %>% summarize(n()))
View(filter(pizza.grouped, Status == "Delivered") %>% group_by(state) %>% summarize(sum(Pizzas_delivered)))

pizza.grouped.lm = left_join(pizza.grouped, us.state.dem[,c("state_abbr", "percent_dem")], by = c("state"= "state_abbr")) %>% ungroup() %>% filter(!is.na(percent_dem), Status == "Delivered")
lm1 = lm(Pizzas_delivered ~ percent_dem + state, data = pizza.grouped.lm)
summary(lm1)

Interactive Plotting with Plotly

Example - Texas housing data

Enter this code in the console:

names(txhousing)

tx.ggplot = ggplot(txhousing) + 
              geom_line(aes(x = interaction(month, year), y = median,
                            group = city, color = city)) +
              theme(axis.text.x = element_text(angle = 90))

tx.ggplotly = ggplotly(tx.ggplot, tooltip = c("x", "median", "group")) 


tx.ggplotly
  • The ggplotly function easily adds interactivity to ggplot

  • We can refer to "x" and "y" in the tooltip by x and y or their assigned variables names. ("city" tooltip would be duplicated.)

Structure of plots in Plotly

  • Like ggplot, plots are built in layers
    • ggplotly translates a ggplot layer into one or more plotly.js traces
    • Every trace has a type, and the default is "scatter"
    • instead of aes we use ~
    • There is default behavior for how to translate certain layer types, but you can usually customize it.
  • Data-plot-pipeline
    • Plots have data attached
  • View plot structure (https://plotly-book.cpsievert.me/extending-ggplotly.html#modifying-layers)
    • This is very helfpul for diagnosing problems
    • For example, it helped me figure out where legends were being turned off

Three ways to build plots in Plotly

  1. Build the plot in ggplot and convert it to plotly with ggplotly
  2. Start the plot in ggplot, convert and add elements in plotly
  • Useful for doing things in the easier syntax
  1. Build the whole thing in plotly (plot_ly)

"The initial inspiration for the plot_ly() function was to support plotly.js chart types that ggplot2 doesn’t support…This newer “non-ggplot2” interface to plotly.js is currently not, and may never be, as fully featured as ggplot2.""

Example - Three ways to build plotly

open options_to_build_plotly.R

Exercise - Build a plotly in three ways

Customizing plotly output

#color/colors, symbol/symbols, linetype/linetypes, size/sizes - These arguments are unique to the R package 

#layout() function used earlier

#Editing traces after plotly conversion
#style() function

#plotly_json(tx.ggplotly)

The data-plot-pipeline

tx.ggplotly2 = 
  tx.ggplotly %>%
  group_by(interaction(month, year)) %>%
  summarize(overall_med = median(median, na.rm = TRUE)) %>% 
  add_lines(y = ~overall_med, color = I("black"), size = I(3), name = "overall_med")

tx.ggplotly2

plotly_json(tx.ggplotly2)
plotly_data(tx.ggplotly2, id = 3)
names(plotly_data(tx.ggplotly2, id = 2)) 

Branch off with add_fun. Or call external function with city name as argument.

tx.ggplotly %>%
  add_fun(function(plot) {
    plot %>% ungroup() %>% filter(city == "Houston") %>%
      add_lines(y = ~median, name = "Houston", color = I("black"))
  }) %>%
  add_fun(function(plot) {
    plot %>% ungroup() %>% filter(city == "San Antonio") %>%
      add_lines(y = ~median, name = "San Antonio", linetype = I(3))
  }) 

Rangeslider

The rangeslider function lets us control the visible time range of data

widgetframe::frameWidget(
  tx.ggplotly2 %>% rangeslider(start = 1, end = length(tx.ggplotly2$x$data[[1]]$x))
)

Example - Interactive pizza plot

What is still lacking?

Linked plotting

sd <- SharedData$new(txhousing, ~city, "Select a city")

base <- plot_ly(sd, color = I("black"), height = 400) %>% group_by(city)

p1 <- base %>%
  summarise(miss = sum(is.na(median))) %>% 
  filter(miss > 0) %>%
  add_markers(x = ~miss, y = ~fct_reorder(city, miss), hoverinfo = "x+y") %>%
  layout(barmode = "overlay", xaxis = list(title = "Number of months missing")) 

p2 <- base %>% add_lines(x = ~date, y = ~median, alpha = 0.3)

subplot(p1, p2, titleX = TRUE, widths = c(0.3, 0.7)) %>% hide_legend() %>%
   highlight(dynamic = TRUE, selectize = TRUE)

Arranging linked plots

  • subplot in plotly
    • A plotly subplot is a single plotly graph with multiple traces anchored on different axes
    • subplot returns a plotly object, so it can be applied recursively (can have subplots within subplots)
    • Underneath the hood in plotly, ggplot2 facets (e.g. from facet_wrap) are implemented as subplots, which enables the synchronized zoom events on shared axes
    • plotly-book.cpsievert.me/merging-plotly-objects.html
  • bscols in crosstalk
  • tagList, tags$div in htmltools
  • fluidPage from Shiny

Other widgets - datatable example

  • The DT package is an interface to the JavaScript library DataTables. The function datatable() creates an HTML widget as shown below.

  • The widgetframe function frameWidget is used to embed the datatable output in this presentation.

  • Other crosstalk-compatible widgets include d3scatter, leaflet (maps), summarywidget, rgl (3D). (See https://rstudio.github.io/crosstalk/widgets.html)

widgetframe::frameWidget(
  DT::datatable(txhousing, options = list(pageLength = 3))
)

Linking with plotly, pizza example

Review - Advantages and Limitations

  • Newer code, still under development. Unexpected limitations and bugs.
  • Related Packages - Bokeh (rbokeh), Highcharts (highcharter), rCharts, and more

Related Packages - Bokeh

Related Packages - Highcharter

Comparison of heatmaps

Create a correlation matrix of median housing prices of Texas cities and show default heatmaps for each package.

txhousing.heatmap.data = txhousing %>%
  mutate(time = as.character(interaction(year, month))) %>%
  select("city", "time", "median") %>%
  dcast(time ~ city, na.rm = T, value.var = "median") %>%
  select(-c("time")) %>%
  cor(use = "pairwise.complete.obs")

# base
image(as.matrix(txhousing.heatmap.data))
# ggplot
ggplot(melt(txhousing.heatmap.data)) + 
          geom_tile(aes(x = Var1, y = Var2, fill = value))
# plotly
ggplotly(ggplot(melt(txhousing.heatmap.data)) + 
          geom_tile(aes(x = Var1, y = Var2, fill = value)))
# rbokeh
figure(width = 600) %>% ly_crect(data = melt(txhousing.heatmap.data), 
          x = Var1, y = Var2, color = value)
# highcharter
hchart(melt(txhousing.heatmap.data), "heatmap", 
          hcaes(x = Var1, y = Var2, value = value))

Part II - Under the hood (Duncan)

  • General:
  • What is plotly (and related) actually doing – how is R code translated to web (js, json, html, css)
  • How events trigger code on the server (https://plotly-book.cpsievert.me/linking-views-with-shiny.html)
  • Comparisons of interactive plotting packages (plotly, bokeh, higcharter) in terms of how they interactive with javascript. Important differences for performance?
  • Comparing backends (html, server, server running R), pros and cons?

  • Building a custom htmlwidget?
  • Specific limitations I've encountered:
  • Adding a main title with bscols
  • Converting a dotplot to plotly
  • Weird double legend (doesn't appear in workspace) and controlling legend symbol (see slide: Example - Interactive pizza plot)
  • The plotly_json function has solved some of my issues – allows easy navigation of plot json